Loading required packages, source files and data.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(highcharter)
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(ggplot2)

source('../unbalanced_functions.R')
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
## Loading required package: lattice
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
data <- read_csv('../data/murder_suicide.csv')
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Sex = col_character(),
##   MaritalStatus = col_character(),
##   InjuryAtWork = col_character(),
##   MethodOfDisposition = col_character(),
##   Autopsy = col_character(),
##   Icd10Code = col_character()
## )
## See spec(...) for full column specifications.

با توجه به سوالات مرگ و میر در آمریکا به سوالات زیر پاسخ دهید.


۱. از میان متغیرهای داده مرگ و میر یک زیرمجموعه ایی بدون حشو در نظر بگیرید. ماتریس همبستگی متغیرهای مختلف را به دست آورده و سپس رسم نمایید. علاوه بر این نمودار پراکنش متغیرهای انتخاب شده را همزمان نسبت به هم رسم نمایید.


In order to find a non redundant set of attributes, we started by semantically analysing them and removing the calculated ones, the ones with just a single value, and the ones in which one value was the absolute dominant one.

We’ve also removed the semanticless, artificially added attributes such as Id.
The result is called non redundant column names.

In order to further minimize the set, we’ve ommited the attributes with caotic behaviours such as MonthOfDeath and WeekDayOfDeath.
We’ve also deleted the attributes with very little absolute correlation to others. such as HispanicOriginRaceRecode.
The resulting set is called minified non redundant column names.

Then we’ve drawn three different correlation plots for the resulting data. The third one also has a confidence level analysis built-in.

Finally we’ve drawn the Distribution diagram for every pair of the 14 remaining attributes.

###############################################################################################
## Q1

data %>% colnames -> colnames
data %>% filter(AgeType == 1) -> data

colnames %>% setdiff(c('EducationReportingFlag', 'AgeType', 'AgeRecode12',
                       'AgeRecode27', 'AgeRecode52', 'InfantAgeRecode22',
                       'RaceImputationFlag', 'RaceRecode3', 'RaceRecode5',
                       'BridgedRaceFlag', 'CurrentDataYear', 'AgeSubstitutionFlag',
                       'CauseRecode358', 'CauseRecode113', 'InfantCauseRecode130',
                       'HispanicOrigin', 'Id')) -> non_redundant_colnames

non_redundant_colnames %>% setdiff(c('HispanicOriginRaceRecode', 'NumberOfEntityAxisConditions',
                                     'NumberOfRecordAxisConditions', 'DayOfWeekOfDeath', 'Race',
                                     'MonthOfDeath', 'PlaceOfDeathAndDecedentsStatus')) -> minified_non_redundant_colnames

data %>% select(non_redundant_colnames) %>%
  mutate_if(is.character, factor) %>% mutate_if(is.factor, as.numeric) -> q1_nr_data

data %>% select(minified_non_redundant_colnames) %>%
  mutate_if(is.character, factor) %>% mutate_if(is.factor, as.numeric) -> minified_q1_nr_data


minified_q1_nr_data %>% cor(use = 'pairwise.complete.obs') -> q1_cor
q1_cor %>% hchart()
library(corrplot)
## corrplot 0.84 loaded
minified_q1_nr_data %>% cor.mtest(conf.level = .95) -> res1
q1_cor %>% corrplot(type = 'upper')

q1_cor %>% corrplot(type = 'upper', p.mat = res1$p, sig.level = 1e-90, pch.cex = 1.5)

minified_q1_nr_data %>% slice(1:500) %>% plot()


۲. اثر هر یک از متغیرهای جنسیت، نژاد،آموزش، سن و نحوه تدفین را بر مرگ یا خودکشی ارزیابی کنید.


In order to test the Null hypothesis that these variables have no effect on the Manner of Death, we’ve used the Kruskal-Wallis non parameteric chi-squared test.
In order to further understand the relations of this variables on the Manner of Death, we’ve visualized the relations using the proper diagrams.

The results are somewhat interesting!

Firstly the null hypothesis for all of the Sex, Race, Education, Age and MethodOfDisposition variables is confidently rejected.

Here are some interesting atatements:

###############################################################################################
## Q2

# cbind(
#   data %>% select(MethodOfDisposition),
#   data %>% mutate_if(is.character, factor) %>% mutate_if(is.factor, as.numeric) %>% select(MethodOfDisposition)
# ) %>% View()
q1_nr_data %>% mutate(ms = ifelse(MannerOfDeath %in% c(2, 6), 'Suicide',
                                  ifelse(MannerOfDeath == 3, 'Murder', 'Other'))) -> q2_data

############################

male_count <- q2_data %>% filter(Sex == 2) %>% count() %>% as.integer()
female_count <- q2_data %>% filter(Sex != 2) %>% count() %>% as.integer()

q2_data %>% group_by(Sex, ms) %>% summarise(count = n()) %>% ungroup() %>% 
  mutate(Sex = ifelse(Sex == 2, 'Male', 'Female')) %>%
  mutate(ratio = ifelse(Sex == 'Male', count/male_count, count/female_count)) %>% 
  hchart(type = 'column', hcaes(x = ms, y = ratio, group = Sex))
kruskal.test(ms ~ Sex, data = q2_data) %>% print
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ms by Sex
## Kruskal-Wallis chi-squared = 33.535, df = 1, p-value = 7.001e-09
############################

race_counts <- q2_data %>% group_by(Race) %>% summarise(race_count = n())
race_names <- read_csv('../data/Race.csv')
## Parsed with column specification:
## cols(
##   Code = col_integer(),
##   Description = col_character()
## )
full_join(
  race_counts,
  race_names %>% rename(Race = Code),
  by = 'Race'
) %>% full_join(
  q2_data,
  by = 'Race'
) %>% group_by(Race, ms, race_count, Description) %>% summarise(count = n()) %>% ungroup() %>%
  mutate(ratio = count/race_count) %>% .[complete.cases(.),] %>% 
  hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))
kruskal.test(ms ~ Race, data = q2_data) %>% print
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ms by Race
## Kruskal-Wallis chi-squared = 15366, df = 13, p-value < 2.2e-16
############################

q2_data_1989 <- q2_data %>% filter(Education1989Revision != 0)
q2_data_2003 <- q2_data %>% filter(Education2003Revision != 0)


edu_1989_names <- read_csv('../data/Education1989Revision.csv')
## Parsed with column specification:
## cols(
##   Code = col_integer(),
##   Description = col_character()
## )
edu_1989_names %>% mutate(Description = ifelse(Description == 'Years of elementary school', paste(Code, Description, sep = ' '), Description)) -> edu_1989_names

full_join(
  q2_data %>% group_by(Education1989Revision) %>% summarise(edu_count = n()) %>% filter(Education1989Revision != 0),
  edu_1989_names %>% rename(Education1989Revision = Code),
  by = 'Education1989Revision'
) %>% full_join(
  q2_data_1989,
  by = 'Education1989Revision'
) %>% group_by(Education1989Revision, ms, edu_count, Description) %>% summarise(count = n()) %>% ungroup() %>%
  mutate(ratio = count/edu_count) %>% .[complete.cases(.),] %>% 
  hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))
kruskal.test(ms ~ Education1989Revision, data = q2_data_1989) %>% print
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ms by Education1989Revision
## Kruskal-Wallis chi-squared = 292.21, df = 17, p-value < 2.2e-16
############################

edu_2003_names <- read_csv('../data/Education2003Revision.csv')
## Parsed with column specification:
## cols(
##   Code = col_integer(),
##   Description = col_character()
## )
full_join(
  q2_data %>% group_by(Education2003Revision) %>% summarise(edu_count = n()) %>% filter(Education2003Revision != 0),
  edu_2003_names %>% rename(Education2003Revision = Code),
  by = 'Education2003Revision'
) %>% full_join(
  q2_data_2003,
  by = 'Education2003Revision'
) %>% group_by(Education2003Revision, ms, edu_count, Description) %>% summarise(count = n()) %>% ungroup() %>%
  mutate(ratio = count/edu_count) %>% .[complete.cases(.),] %>% 
  hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))
kruskal.test(ms ~ Education2003Revision, data = q2_data_2003) %>% print
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ms by Education2003Revision
## Kruskal-Wallis chi-squared = 3315.7, df = 8, p-value < 2.2e-16
############################

full_join(
  q2_data %>% filter(Age < 250) %>% group_by(Age) %>% summarise(age_count = n()),
  q2_data,
  by = 'Age'
) %>% 
  group_by(Age, ms, age_count) %>% summarise(count = n()) %>% ungroup() %>%
  mutate(ratio = count/age_count) %>% .[complete.cases(.),] %>%
  hchart(type = 'spline', hcaes(x = Age, y = ratio, group = ms))
q2_data %>% filter(Age < 250) %>%
  ggplot(aes(y = Age, x = ms)) + geom_boxplot(notch=TRUE, outlier.colour="red", outlier.shape=8)

kruskal.test(ms ~ Age, data = q2_data) %>% print
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ms by Age
## Kruskal-Wallis chi-squared = 6313.6, df = 103, p-value < 2.2e-16
############################

disposition_names <- read_csv('../data/MethodOfDisposition-custom.csv')
## Parsed with column specification:
## cols(
##   Code = col_character(),
##   Description = col_character()
## )
disposition_count <- q2_data %>% group_by(MethodOfDisposition) %>% summarise(dispo_count = n())

cbind(
  disposition_names,
  disposition_count
) %>% full_join(
  q2_data,
  by = 'MethodOfDisposition'
) %>% group_by(ms, Description, dispo_count) %>% summarise(count = n()) %>% ungroup() %>% 
  mutate(ratio = count/dispo_count) -> q2_dispo_data

q2_dispo_data %>% hchart(type = 'column', hcaes(x = Description, y = ratio, group = ms))
q2_dispo_data %>% hchart(type = 'column', hcaes(x = Description, y = count, group = ms))
kruskal.test(ms ~ MethodOfDisposition, data = q2_data) %>% print
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ms by MethodOfDisposition
## Kruskal-Wallis chi-squared = 4407.9, df = 6, p-value < 2.2e-16

۳. با استفاده از مدل رگرسیون لاجستیک یک مدل به داده ها برازش دهید و سپس آن را نقص یابی کنید.


Firstly we mutate the MannerOfDeath attribute to be a binary indicator of murder or suicide.
Then we create our logistic regression model based on the mutated minified data from Q1.
The analysis of models suggests that the Education1989Revision, InjuryAtWork attributes may have no effect on the MannerOfDeath.
Using the diagnosis we improve our model, removing these variables from it.
The analysis of the final model can be found below.

###############################################################################################
## Q3

minified_q1_nr_data %>% mutate(MannerOfDeath = MannerOfDeath %in% c(2, 6)) -> q3_data
q3_data %>% filter(Age < 250) -> q3_data

glm(MannerOfDeath~., data = q3_data, family = binomial(link = 'logit')) -> q3_logit_model_1
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
q3_logit_model_1 %>% summary.glm() %>% print
## 
## Call:
## glm(formula = MannerOfDeath ~ ., family = binomial(link = "logit"), 
##     data = q3_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.0712  -0.0702   0.0885   0.2629   8.4904  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           45.637315   1.399101  32.619  < 2e-16 ***
## ResidentStatus         0.109321   0.031126   3.512 0.000444 ***
## Education1989Revision  0.008698   0.005362   1.622 0.104804    
## Education2003Revision  0.180788   0.011872  15.228  < 2e-16 ***
## Sex                    0.354360   0.044982   7.878 3.33e-15 ***
## Age                    0.017591   0.001111  15.829  < 2e-16 ***
## MaritalStatus         -0.153787   0.020146  -7.634 2.28e-14 ***
## InjuryAtWork          -0.071879   0.055300  -1.300 0.193674    
## MethodOfDisposition    0.119863   0.014074   8.517  < 2e-16 ***
## Autopsy               -1.400189   0.042340 -33.070  < 2e-16 ***
## ActivityCode          -0.284778   0.004746 -60.000  < 2e-16 ***
## PlaceOfInjury         -0.168226   0.002653 -63.414  < 2e-16 ***
## Icd10Code             -0.239010   0.003040 -78.620  < 2e-16 ***
## CauseRecode39          0.330360   0.044565   7.413 1.24e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 70507  on 59695  degrees of freedom
## Residual deviance: 18947  on 59682  degrees of freedom
## AIC: 18975
## 
## Number of Fisher Scoring iterations: 14
glm(MannerOfDeath~.-InjuryAtWork-Education1989Revision, data = q3_data, binomial(link = 'logit')) -> q3_logit_model_2
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
q3_logit_model_2 %>% summary.glm() %>% print
## 
## Call:
## glm(formula = MannerOfDeath ~ . - InjuryAtWork - Education1989Revision, 
##     family = binomial(link = "logit"), data = q3_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.0807  -0.0702   0.0889   0.2632   8.4904  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           45.522912   1.395182  32.629  < 2e-16 ***
## ResidentStatus         0.107518   0.031107   3.456 0.000548 ***
## Education2003Revision  0.176848   0.011546  15.317  < 2e-16 ***
## Sex                    0.352952   0.044970   7.849 4.20e-15 ***
## Age                    0.017624   0.001110  15.882  < 2e-16 ***
## MaritalStatus         -0.152723   0.020145  -7.581 3.43e-14 ***
## MethodOfDisposition    0.131316   0.011303  11.617  < 2e-16 ***
## Autopsy               -1.395587   0.042101 -33.149  < 2e-16 ***
## ActivityCode          -0.284842   0.004748 -59.995  < 2e-16 ***
## PlaceOfInjury         -0.168442   0.002647 -63.634  < 2e-16 ***
## Icd10Code             -0.239098   0.003038 -78.702  < 2e-16 ***
## CauseRecode39          0.331384   0.044532   7.442 9.95e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 70507  on 59695  degrees of freedom
## Residual deviance: 18945  on 59684  degrees of freedom
## AIC: 18969
## 
## Number of Fisher Scoring iterations: 13

۴. با استفاده از سه نمودار خروجی مدل را نسبت به داده واقعی ارزیابی کنید.


We’ve analysed model output of the real data using four diagrams.
The results are followed:

###############################################################################################
## Q4

q3_data %>% mutate(prediction = fitted(q3_logit_model_2)) %>%
  ggplot(aes(x = Age, y = prediction, color = MannerOfDeath)) + geom_point()

q3_data %>% mutate(prediction = predict(q3_logit_model_2, type = 'response')) %>%
  mutate(MannerOfDeath = as.integer(MannerOfDeath)) %>%
  ggplot(aes(x = Age, y = MannerOfDeath)) + 
  geom_point(aes(x =  Age, y = prediction), color = 'red', alpha = 0.2) + 
  geom_point(alpha = 0.005, size = 3)

library(ggthemes)

q3_data %>% mutate(prediction = predict(q3_logit_model_2, type = 'response')) %>%
  ggplot(aes(fitted(q3_logit_model_2), color = as.factor(MannerOfDeath))) + 
  geom_density(size = 0.5) +
  ggtitle("Training Set's Predicted Score") + 
  scale_color_economist(name = "data", labels = c("negative", "positive"))

table(q3_data$MannerOfDeath,ifelse(fitted(q3_logit_model_2)>0.3,1,0)) %>% plot()


۵. ابتدا ۲۰ درصد داده را به صورت تصادفی به عنوان تست در نظر بگیرید. مدل را با استفاده از ۸۰ درصد باقی مانده برازش دهید. با استفاده از پارامتر قطع ۰.۵ نتایج را برای داده تست پیش بینی کنید. سپس کمیت های زیر را محاسبه کنید.

مشابه آنچه در کلاس گفته شد نمایشی از چهار کمیت TN, TP,FP,FN به همراه داده ها رسم نمایید.


Firstly, the data is randomly devided into a 20% testing data set and a 80% modeling dataset. Then the model is trained using the training dataset and requested statistics are calculated for the model.
Then the musaic plot is drawn to show the model accuracy and type I and II errors.
Then the confusion matrix plot is drawn. It’s somewhat based on the same idea as the mosaic plot.
After that the diagram of training/test accuracy is drawn for different cutoffs in two different levels of magnification.

###############################################################################################
## Q5
q3_data %>% nrow -> n_data
q5_sample <- sample(1:n_data, size = 0.8*n_data, replace = FALSE)

q3_data %>% mutate(MannerOfDeath = as.integer(MannerOfDeath)) %>% select(-Education1989Revision,-InjuryAtWork) -> q5_data



q4_model_data <- q5_data %>% .[q5_sample,]
q4_test_data <- q5_data %>% .[-q5_sample,]

glm(MannerOfDeath~., data = q4_model_data, family = 'binomial') -> q4_logit_model
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
q4_test_data$prediction   <- predict.glm(q4_logit_model, newdata = q4_test_data  , type = "response")
q4_model_data$prediction  <- predict.glm(q4_logit_model, newdata = q4_model_data , type = "response")
q4_logit_model %>% summary.glm()
## 
## Call:
## glm(formula = MannerOfDeath ~ ., family = "binomial", data = q4_model_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -8.49    0.00    0.00    0.00    8.49  
## 
## Coefficients:
##                         Estimate Std. Error    z value Pr(>|z|)    
## (Intercept)            1.725e+15  1.447e+07  1.192e+08   <2e-16 ***
## ResidentStatus         1.669e+14  5.730e+05  2.913e+08   <2e-16 ***
## Education2003Revision  4.180e+13  1.827e+05  2.288e+08   <2e-16 ***
## Sex                   -3.716e+12  7.470e+05 -4.974e+06   <2e-16 ***
## Age                    1.113e+12  1.737e+04  6.409e+07   <2e-16 ***
## MaritalStatus         -2.187e+13  3.134e+05 -6.978e+07   <2e-16 ***
## MethodOfDisposition    3.094e+13  1.957e+05  1.582e+08   <2e-16 ***
## Autopsy               -4.483e+14  3.453e+05 -1.298e+09   <2e-16 ***
## ActivityCode          -1.615e+13  4.048e+04 -3.990e+08   <2e-16 ***
## PlaceOfInjury         -2.071e+13  3.234e+04 -6.404e+08   <2e-16 ***
## Icd10Code             -2.181e+13  3.085e+04 -7.069e+08   <2e-16 ***
## CauseRecode39          1.090e+14  4.642e+05  2.347e+08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance:  56257  on 47755  degrees of freedom
## Residual deviance: 237456  on 47744  degrees of freedom
## AIC: 237480
## 
## Number of Fisher Scoring iterations: 25
# suicide is TRUE, murder is FALSE

q4_test_data %>% mutate(prediction = predict.glm(q4_logit_model,q4_test_data ,type = 'response')) %>%
  select(prediction, MannerOfDeath) %>% mutate(prediction_end = prediction > 0.5) -> q4_calc_data

q4_calc_data %>% summarise(P = sum(prediction_end),
                           N = n() -  sum(prediction_end),
                           TP = sum(prediction_end & MannerOfDeath),
                           TN = sum(!prediction_end & !MannerOfDeath),
                           FP = sum(prediction_end & !MannerOfDeath),
                           FN = sum(!prediction_end & MannerOfDeath)) %>% 
  mutate(ACC = (TP+TN)/(P+N),
         FPR = 1- (TN/N),
         TPR = TP/P) %>% print
## # A tibble: 1 x 9
##       P     N    TP    TN    FP    FN   ACC    FPR   TPR
##   <int> <int> <int> <int> <int> <int> <dbl>  <dbl> <dbl>
## 1  9222  2718  8452  2620   770    98 0.927 0.0361 0.917
table(q4_test_data$MannerOfDeath, ifelse(predict.glm(q4_logit_model, q4_test_data, type = 'response')>0.5,1,0)) %>%
  plot()

cm_info <- ConfusionMatrixInfo(data = q4_test_data, predict = "prediction",actual = "MannerOfDeath", cutoff = .5 )
cm_info$plot
## Warning: Removed 5864 rows containing missing values (geom_point).

accuracy_info <- AccuracyCutoffInfo(train = q4_model_data, test = q4_test_data, predict = "prediction", actual = "MannerOfDeath")
accuracy_info$plot

calculate_acc <- function(data, cutoff) {
  P <- sum(data$prediction > cutoff)
  N <- sum(data$prediction <= cutoff)
  TP <- sum((data$prediction > cutoff) & (data$MannerOfDeath == 1))
  TN <- sum(!(data$prediction > cutoff) & !(data$MannerOfDeath == 1))
  
  ACC <- (TP+TN)/(P + N)
  
  return(ACC)
}
draw_acc_diag <- function(gaps) {
  sapply(seq(0, 1, gaps), function(x) {calculate_acc(q4_test_data, x)}) -> all_test_acc_vals
  sapply(seq(0, 1, gaps), function(x) {calculate_acc(q4_model_data, x)}) -> all_model_acc_vals
  rbind(
    cbind(
      seq(0, 1, gaps) %>% as.data.frame %>% rename(cutoff = '.'),
      accuracy = all_test_acc_vals,
      data = 'test'
    )
    ,
    cbind(
      seq(0, 1, gaps) %>% as.data.frame %>% rename(cutoff = '.'),
      accuracy = all_model_acc_vals,
      data = 'train'
    )
  ) %>% hchart(type = 'line', hcaes(x = cutoff, y = accuracy, group = data))
}

draw_acc_diag(0.005)

۶. نمودار صحت مدل (accuracy) را بر حسب مقادیر مختلف قطع برای داده تست رسم نمایید. کدام پارامتر قطع بالاترین صحت را در پیش بینی داراست؟


The diagram was the last diagram of the previous question! :D
based on the diagram the best cutoff value is about 0.443.

###############################################################################################
## Q6

# draw_acc_diag(0.0005)


۷. نمودار ROC را برای داده های قسمت قبل رسم نمایید. همچنین نقطه مربوط به بهترین پارامتر قطع را مشخص نمایید.


The ROC diagram is drawn and can be found bellow.
based on this diagram, the best cutoff value is also about 0.443.

###############################################################################################
## Q7

cost_fp = 100
cost_fn = 200
roc_info = ROCInfo(data = cm_info$data, predict = "predict", 
                    actual = "actual", cost.fp = cost_fp, cost.fn = cost_fn)
grid.draw(roc_info$plot)


۸. با قرار دادن کمیت nfolds = 5 و با استفاده از H20 مدل مساله را بسازید و نتیجه حاصل را ارزیابی کنید.

In order to use h2o package for creating the generalized linear regression model, we’ve first reformatted the data to be compatible with the package.
Then using h2o.glm model we’ve created the model.
statistics of the model show extreme accuracy!

###############################################################################################
## Q8

library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 51 minutes 
##     H2O cluster timezone:       Asia/Tehran 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.19.0.4257 
##     H2O cluster version age:    19 days  
##     H2O cluster name:           H2O_started_from_R_mohammadmahdi_rem508 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.64 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.4 (2018-03-15)
q8_data <- as.h2o(q5_data)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
h2o_model <- h2o.glm(y = "MannerOfDeath", x= colnames(q8_data),
                training_frame = q8_data, family="binomial" ,nfolds = 5)
## Warning in .verify_dataxy(training_frame, x, y): removing response variable
## from the explanatory variables
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |=================================================================| 100%
h2o_model %>% print
## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model ID:  GLM_model_R_1524826888756_55 
## GLM Model: summary
##     family  link                                regularization
## 1 binomial logit Elastic Net (alpha = 0.5, lambda = 5.018E-4 )
##   number_of_predictors_total number_of_active_predictors
## 1                         11                          11
##   number_of_iterations training_frame
## 1                    6        q5_data
## 
## Coefficients: glm coefficients
##                    names coefficients standardized_coefficients
## 1              Intercept    50.357145                  1.708069
## 2         ResidentStatus     0.080617                  0.043723
## 3  Education2003Revision     0.167756                  0.324389
## 4                    Sex     0.317856                  0.131790
## 5                    Age     0.017433                  0.326704
## 6          MaritalStatus    -0.145887                 -0.145127
## 7    MethodOfDisposition     0.125112                  0.221123
## 8                Autopsy    -1.326282                 -1.226362
## 9           ActivityCode    -0.262539                 -3.221947
## 10         PlaceOfInjury    -0.156193                 -2.280730
## 11             Icd10Code    -0.216646                 -3.651898
## 12         CauseRecode39     0.080348                  0.094597
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.02031003
## RMSE:  0.1425133
## LogLoss:  0.1629414
## Mean Per-Class Error:  0.02279234
## AUC:  0.986859
## Gini:  0.973718
## R^2:  0.8986938
## Residual Deviance:  21305.49
## AIC:  21329.49
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##            0     1    Error        Rate
## 0      15918   646 0.039000  =646/16564
## 1        284 42848 0.006584  =284/43132
## Totals 16202 43494 0.015579  =930/59696
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.447941 0.989264 216
## 2                       max f2  0.378976 0.992744 232
## 3                 max f0point5  0.620604 0.989095 182
## 4                 max accuracy  0.461124 0.984421 213
## 5                max precision  0.963038 0.993562  50
## 6                   max recall  0.000041 1.000000 399
## 7              max specificity  0.999686 0.994446   0
## 8             max absolute_mcc  0.461124 0.961001 213
## 9   max min_per_class_accuracy  0.679682 0.977300 170
## 10 max mean_per_class_accuracy  0.555405 0.979638 196
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.02066579
## RMSE:  0.143756
## LogLoss:  0.1660724
## Mean Per-Class Error:  0.02252939
## AUC:  0.9866897
## Gini:  0.9733793
## R^2:  0.8969193
## Residual Deviance:  21676.75
## AIC:  21700.75
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##            0     1    Error        Rate
## 0      15939   625 0.037732  =625/16564
## 1        316 42816 0.007326  =316/43132
## Totals 16255 43441 0.015763  =941/59696
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.460862 0.989131 221
## 2                       max f2  0.383037 0.992572 238
## 3                 max f0point5  0.614252 0.988985 189
## 4                 max accuracy  0.460862 0.984237 221
## 5                max precision  0.939405 0.993349  66
## 6                   max recall  0.000018 1.000000 399
## 7              max specificity  0.999641 0.994868   0
## 8             max absolute_mcc  0.460862 0.960540 221
## 9   max min_per_class_accuracy  0.678087 0.976878 175
## 10 max mean_per_class_accuracy  0.573033 0.979460 197
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                  mean          sd  cv_1_valid  cv_2_valid  cv_3_valid
## accuracy   0.98458874 5.956392E-4  0.98386693  0.98522586   0.9850883
## auc         0.9866854 6.315878E-4  0.98773897  0.98779094  0.98619944
## err       0.015411257 5.956392E-4 0.016133077 0.014774166 0.014911696
## err_count       184.0   7.1554174       193.0       175.0       179.0
## f0point5    0.9877865 4.063343E-4   0.9879097   0.9877563    0.988348
##            cv_4_valid  cv_5_valid
## accuracy   0.98545027  0.98331237
## auc         0.9859518  0.98574597
## err       0.014549712 0.016687632
## err_count       174.0       199.0
## f0point5   0.98820263  0.98671573
## 
## ---
##                         mean           sd cv_1_valid cv_2_valid cv_3_valid
## precision          0.9867373 4.8671348E-4  0.9872102  0.9864017 0.98746115
## r2                 0.8969126 0.0013266645  0.8964911  0.8991272 0.89815253
## recall             0.9920066 6.4188975E-4  0.9907174  0.9932124 0.99191123
## residual_deviance  4335.3506    230.07188  4615.1147   4451.337  4707.6543
## rmse              0.14374892  9.606604E-4 0.14293139 0.14238524 0.14314623
## specificity          0.96528  0.001240069 0.96540004  0.9645454 0.96746266
##                   cv_4_valid cv_5_valid
## precision           0.987061  0.9855525
## r2                 0.8971822  0.8936099
## recall             0.9927957 0.99139637
## residual_deviance  3906.3455  3996.3013
## rmse              0.14403123 0.14625049
## specificity        0.9665971  0.9623947

۹. آیا ما میتوانیم سرویسی به قضات ارایه کنیم تا با استفاده از اطلاعات مرگ بتوانند موارد مشکوک به قتل را از خودکشی تفکیک دهند؟


The model we’ve developed in this homework has very high accuracy and very low error rate.
The high R^2 value indicates that the model is well describing the data distribution.
also the accuracy vs cutoff diagram suggests that we have not performed an overfiting act in training this model.

So based on this information, I think that this model is a powerful model to be used in even in law courts.